pip install git+https://github.com/tanghaibao/jcvi.git
conda install -c conda-forge numpy scipy matplotlib pandas networkx
conda install -c bioconda diamond samtools
conda install -c conda-forge imagemagick
conda install -c conda-forge tectonic
jcvi --version
conda install orthofinder -c bioconda
Reference genomes/proteomes in bold
gene synteny and collinearity - between species
Tang et al. (2024) JCVI: A Versatile Toolkit for Comparative Genomics Analysis. iMeta.
“Most notable features include the “compara” module for comparative genomics analysis, offering tools for synteny block reconstruction, gene loss cataloging, quota-based synteny alignment, pedigree visualization, and Ks calculations and visualizations, among others. …JCVI synteny inference is based on adaptive seeds via LAST and adopts rigorous filtering based on C-score and removal of proximal duplicates, thereby circumventing the difficulties with finding hard E-value cutoffs as well as avoiding artifacts from repeats.”
.last).last.filtered).anchors - seed
synteny blocks, high quality; .lifted.anchors - additional
anchors to form the final synteny blocks).bedkeeping one isoform per gene (parameter --primary_only)
is problematic, since it takes first listed
python -m jcvi.formats.gff bed --type=mRNA --key=Name X.gff3.gz -o X.bed;
or
python -m jcvi.formats.gff bed --type=mRNA --key=ID X.gff3 -o X.bed;
.CDS fastapython -m jcvi.formats.fasta format X.cds.fa.gz X.cds
python -m jcvi.compara.catalog ortholog plant1 plant2 --no_strip_names
python -m jcvi.graphics.dotplot plant1.plant2.anchors
python -m jcvi.compara.synteny depth --histogram plant1.plant2.anchors
python -m jcvi.compara.synteny screen --minspan=30 --simple plant1.plant2.anchors plant1.plant2.anchors.new
python -m jcvi.graphics.karyotype seqids layout
Comparative genomics and proteomics in Ensembl Sep 2006
Herrero et al. (2016). Ensembl comparative genomics resources. Database.
Ensembl Compara Schema Documentation
Ensembl canonical translation of each gene from species in Ensembl
HMM search on the TreeFam HMM library -
classify the sequences into their families
Cluster the genes that did not have any match into additional families
NCBI Blast+ (v2.2.28, parametrs: -seg no
-max_hsps_per_subject 1 -use_sw_tback) on
every orphaned gene against every other (both self and non-self
species)hcluster_sg (-C outgroup:
Saccharomyces cerevisiae, arguments: -m 750
-w 0 -s 0.34 -O)TreeFam clustering is done with outgroupsLarge families are broken down with QuickTree, limit
set to 1,500 genes
For each cluster (family), protein MSA M-Coffee
(v9.03.r1318, afftgins_msa, muscle_msa,
kalign_msa, t_coffee_msa) or
Mafft (large clusters, v7.113,
--auto)
For each aligned cluster, phylogenetic tree TreeBeST
using the CDS back-translation of the protein multiple alignment from
the original DNA sequences. A rooted tree with internal duplication tags
is obtained at this stage, reconciling it against a species tree
inferred from the NCBI taxonomy
ML) tree built, based on the
protein alignment with the WAG model, which takes into the account the
species tree (modified version of phyml release 2.4.5)phyml, based on the codon
alignment with the Hasegawa-Kishino-Yano (HKY) model, also
taking into account the species tree (modified version of phyml release
2.4.5)NJ) tree using
p-distance, based on the codon alignmentNJ tree using dN distance, based on the
codon alignmentNJ tree using dS distance, based on the
codon alignmentthe final tree is made by merging the five trees into one
consensus tree using the “tree merging” algorithm
branch lengths are estimated for the final consensus tree based
on the DNA alignment, using phyml with the HKY
model
From each gene tree, infer gene pairwise relations of orthology and paralogy types
A stable ID is assigned to each GeneTree
Orthologues
Homoeologues
Paralogues
Confidence scores
Jaccard index)Between species paralogues
duplication confidence
score ≤ 0.25Gene splits
1 MB)Confidence scoring
Orthology quality-controls
GOC score: 50WGA score: 50%identity: 25Van Bel et al. (2022) PLAZA 5.0: extending the scope and power of comparative and functional genomics in plants. Nucleic Acids Res.
DIAMOND combined with
Tribe-MCL (homologous families) &
OrthoFinder (sub-families) This way the quality of the
protein clustering can be inspected and can subgroups be detected.MAFFTFastTreeNotung 2.6 - root the trees and to infer speciation and
duplication events using the tree reconciliation mode and applying the
Duplication/Loss Score to evaluate alternate hypothesesPhyD3i-ADHoRe -
detects genomic homology based on the identification of conservation of
gene content and gene orderPAML - date colinear gene pairs based
on aligned coding sequences (CLUSTALW)between species
Majidian et al. (2025) Orthology inference at scale with FastOMA. Nat Methods.
” First, we leverage our current knowledge of the sequence
universe (with its evolutionary information stored in the OMA database)
to efficiently place new sequences into coarse-grained families
(hierarchical orthologous groups (HOGs) at the root level) using the
alignment-free k-mer-based OMAmer
tool.”
“In an attempt to detect homology among unplaced sequences
(which could belong to families that are absent from our reference
database), we then perform a round of clustering using the highly
scalable Linclust software.”
“Next, we resolve the nested structure of the HOGs
corresponding to each ancestor, in an efficient leaf-to-root
traversal of the species tree.”
“By avoiding sequence comparisons across different families, the number of computations is drastically reduced compared with conventional approaches.”
“Input proteomes are mapped to reference gene families using
the OMAmer software, forming hierarchical orthologous
groups (HOGs) at the root level (rootHOGs).”
“HOGs are inferred using a ‘bottom-up’ approach, starting from the leaves of the species tree and moving towards the root.”
“At each taxonomic level, HOGs from the child level are merged, resulting in HOGs at the current level.”
“To decide which HOGs should be merged, sequences from the
child HOGs are used to create a MSA (MAFFT), followed by
gene tree inference (FastTree2) to identify speciation and
duplication events.”
“Child HOGs are merged if their genes evolved through speciation.”
Input:
newick format, branch lengths
not needed.fa extension)
in the folder proteomeParameters:
Steps
OMAmer tool-ln (P value) of having as many or more k-mers
in common between the protein sequence and the HOG under a binomial
distribution -> default threshold = 70 -> graph of
rootHOGs -> add an edge between rootHOGs when a minimum of
10 proteins (default) are mapped to both query rootHOGs and
it represents at least either 80% of all proteins mapping
to the bigger rootHOG or 90% of those mapping to the
smallest oneLinclust (MMseqs package) -> new query rootHOGsMAFFT0.2FastTree20 proteins0 -> speciation event0.1 -> very low
support for a duplication eventmy_custom.config
params {
write_msas = true
report = true
write_genetrees = true
force_pairwise_ortholog_generation = true
filter_method = 'col-elbow-row-threshold'
filter_gap_ratio_row = 0.3
filter_gap_ratio_col = 0.3
nr_repr_per_hog = 12
min_sequence_length = 40
}
# in ./oma_path/
wget https://omabrowser.org/All/OmaServer.h5
wget https://omabrowser.org/All/speciestree.nwk
# NCBITaxonId | ParentTaxonId | Name
# 1437201 | 91827 | Pentapetalae
omamer mkdb --db Pentapetalae.h5 --oma_path ./oma_path/ --root_taxon "Pentapetalae" --nthreads 28 --log_level info
nextflow run ../FastOMA.nf -profile standard \
--input_folder ./input \
--output_folder output \
--omamer_db ./input/Viridiplantae.h5 \
--max_cpus 64 \
--fasta_header_id_transformer noop \
--with-report \
-c ./input/my_custom.config | tee run4.log
nextflow run ../FastOMA/FastOMA.nf -profile standard \
--input_folder ./input \
--output_folder output \
--omamer_db ./input/Pentapetalae.h5 \
--max_cpus 24 \
--fasta_header_id_transformer noop \
--with-report \
-c ./input/my_custom.config | tee run4_Pentapetalae.log
Kuznetsov et al. (2023). OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity. Nucleic Acids Research.
“In OrthoDB, we rely on the OrthoLoger software that
is configured to use MMseqs2 for homology searches, relies
on best-reciprocal-hits between each pair of species for identification
of candidate orthologs (as best-reciprocal-hit is a proxy for
reconciliation of the gene tree and a pair of species), and clusters
these candidates into OGs.”
Tegenfeldt et al. (2025). OrthoDB and BUSCO update: annotation of orthologs with wider sampling of genomes. Nucleic Acids Research.
“Evolutionary histories of multigene families of homologs can be
complex, and striving for more specificity generally results in more
over-splitting of OGs. In contrast, approaches like
OrthoMCL and OrthoFinder favor inclusivity,
increasing sensitivity but reducing specificity, leading to results that
encompass broader families of homologs. OrthoDB (https://www.orthodb.org)
is balancing sensitivity and specificity to optimize accuracy of
functional inferences.”
“OrthoLoger offers ab initio ortholog prediction and a hierarchical mode guided by a user-provided species tree/taxonomy that improves scalability and consistency across different levels of orthology. This hierarchical mode was used for the OrthoDB v12 update.”
OrthoLoger
for delineation of orthologs
ODB-mapper - map fasta file(s) to OrthoDB
orthologsorthologer - find orthologs in a set of fasta
filesdiamond or mmseqssegmasker from NCBI Blastcd-hitBRHCLUSeudicotyledonsEmms, D.M., Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol.
“(a) orthogroup inference, (b) inference of gene trees for each orthogroup, (c and d) analysis of these gene trees to infer the rooted species tree, (e) rooting of the gene trees using the rooted species tree, and (f–h) duplication-loss-coalescence (DLC) analysis of the rooted gene trees to identify orthologs and gene duplication events (mapped to their locations in both the species and gene trees).”
DIAMOND or MMseqs2 (recommended, although
BLAST+ can be used instead)MCL graph clustering algorithmFastMEMAFFT
(recommended), MuscleFastTree (recommended),
IQTREE˙ (takes a very large amount of time to run on a
reasonable sized dataset), raxml#!/bin/bash
# Number of threads to use
THREADS=28
# File containing input/output directory pairs
PAIR_FILE="dir_pairs.txt"
# Log file with timestamp
LOGFILE="orthofinder_$(date '+%Y%m%d_%H%M%S').log"
# Function to print timestamped messages to screen and log
log_msg() {
local msg="$1"
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $msg" | tee -a "$LOGFILE"
}
log_msg "Starting OrthoFinder runs"
while read -r input_dir output_dir; do
log_msg "Running OrthoFinder for input: $input_dir, output: $output_dir"
# Run orthofinder, capture both stdout and stderr, tee to log
orthofinder -f "$input_dir" \
-M msa \
-S diamond_ultra_sens \
-T iqtree \
-t "$THREADS" \
-n orthofinder \
-o "$output_dir" 2>&1 | tee -a "$LOGFILE"
# Check exit status
if [ ${PIPESTATUS[0]} -ne 0 ]; then
log_msg "OrthoFinder failed for $input_dir"
else
log_msg "OrthoFinder completed for $input_dir"
fi
done < "$PAIR_FILE"
log_msg "All OrthoFinder runs completed"
Initialize an empty set or list covered_genes.
For each method in the list: ["MCScanX", "ensembl-compara", "PLAZA", "OrthoDB", "RBH", "FastOMA"]
For every row in the data table dt:
a. Check if criteria is "reject" for the row.
b. Check if the value of the column corresponding to method in this row is TRUE.
c. Check if from_geneID in this row are not in covered_genes.
d. If method is one of ["OrthoDB", "RBH", "FastOMA"]:
Check if gene pair is covered by all three methods:
If yes:
is_candidate = TRUE
new_criteria = "OrthoDB_FastOMA_RBH"
Else if gene pair is covered by any two of those three methods:
If yes:
is_candidate = TRUE
new_criteria = join the two method names with underscore, e.g. "OrthoDB_FastOMA" or "RBH_FastOMA" or "OrthoDB_RBH"
Else if MapMan4_Match string contains "match based on" and current method name:
is_candidate = TRUE
new_criteria = {method}_MapMan4 (e.g., "RBH_MapMan4")
Else:
is_candidate = FALSE
e. If all applicable conditions above are met, mark this row as a candidate for assignment.
After evaluating all rows for the current method:
a. If there are any candidate rows:
i. Assign the criteria of those candidate rows to the current method.
ii. Add all unique to_geneID and from_geneID values from those candidate rows to the covered_genes set to prevent reassignment.
Repeat for next method until all methods are processed.